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Abstract 

Witli exome sequencing becoming a tool for mutation detection in routine diagnostics tliere is an increasing need 
for platform-independent methods of quality control. We present a genotype-weighted metric that allows 
comparison of all the variant calls of an exome to a high-quality reference dataset of an ethnically matched 
population. The exome-wide genotyping accuracy is estimated from the distance to this reference set, and does 
not require any further knowledge about data generation or the bioinformatics involved. The distances of our 
metric are visualized by non-metric multidimensional scaling and serve as an intuitive, standardizable score for the 
quality assessment of exome data. 



Background 

In recent years, next-generation sequencing (NGS)- 
based exome screens have become an invaluable tool in 
Mendelian disease gene discovery and are now being 
introduced as clinical diagnostic tools for genetic disor- 
ders of high phenotypic and genetic heterogeneity [1,2], 
Various solutions for exome enrichment and sequencing 
exist and numerous algorithms for sequence read map- 
ping and variant detection are in use [3-12]. There are 
recommendations for sequencing depth and benchmarks 
for the distribution of sequence read coverage over the 
target region. The common core of the diverse 
approaches to sequence exomes represents the consen- 
sus coding sequences as defined by the consensus cod- 
ing sequence (CCDS) project [13]. The majority of 
publications [12] related to this field seem to confirm 
that a mean sequencing depth of this target region with 
high quality short sequence reads should be above 50- 
fold and more than 90% of the CCDS exons should be 
covered by at least 10 sequence reads for diagnostic pur- 
poses. Another recently introduced parameter for the 
quality assessment of multiple read alignments is the 
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variance of the ratio of reads that support the alternate 
allele at heterozygous positions [14]. The lower this var- 
iance, the lower the error rate to be expected from 
amplification artifacts. The ratio of transitions versus 
transversions (ti/tv) and the proportion of variants that 
are already listed in databases of genetic variation such 
as the Single Nucleotide Polymorphism database 
(dbSNP) [15] are measures of quality that may be 
applied to the entire variant call set of an exome. The 
ti/tv ratio should be close to 3:1 for the CCDS exons, 
and the proportion of singletons should be below 10% 
[16]. However, the ti/tv ratio is influenced by the target 
region, whereas the number of novel variants may 
depend on the background population. The higher the 
amount of non-coding variants, the lower the ti/tv ratio, 
and higher proportions of novel variants may be 
observed if the sequenced individual is from a popula- 
tion that is poorly represented in the variant databases. 

Although these parameters may serve as valuable indi- 
cators for quality they do not directly indicate the accu- 
racy of a sequenced exome and to our knowledge there is 
no criteria for assessing whether the variants identified 
by whole exome sequencing represent a comprehensive 
list. Specifically, it is not possible to estimate an exome- 
wide false positive or negative rate for variant detection 
that is purely based on the quality scores of genotype 
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calls. Sequencing technology-specific error signatures 
[17] can yield artificial variant calls of erroneous high 
quality and result in an underestimated proportion of 
false calls, while poorly adjusted bioinformatics pipelines 
for data processing may lead to missed calls. In most 
NGS studies, a Phred-like quality score is provided for 
each called genotype. This score describes the confidence 
in a genotype call. Based on a certain likelihood model 
for genotypes, the Phred score represents the probability 
that the genotype call is wrong, given the reads in an 
alignment (Phred score = - 10log^QPr{wrong genotype]]). 
In a model that assumes, for example, a Bernoulli distri- 
bution of the sequence reads at a heterozygous position, 
the Phred score of a heterozygous genotype would 
decrease the more the ratio of reads supporting the alter- 
nate allele deviates from the expected value of 0.5. How- 
ever, this quality score not only depends on the raw data 
but also on the mapping algorithms and probability mod- 
els that were used for variant calling. That means that 
processing the same raw data by different bioinformatics 
pipelines may result in different distributions of quality 
scores, suggesting different genotyping error profiles for 
the same exome. Even variant calling approaches that are 
based on similar Bayesian methods do not yield the same 
genotype probabilities due to different priors [18], and 
methods of quality score recalibration cannot completely 
adjust for that effect (Additional file 1, Table SI). 

In order to enable interoperability and platform inde- 
pendence, we have developed a method to measure the 
accuracy of a set of variants by assessing its composi- 
tion. In our metric, the distance between sets of variants 
from two exome samples is computed without consider- 
ing genotyping quality scores. The basic idea is that the 
distance between variant sets of comparable quality is 
closer than the distance between variant sets of very dif- 
ferent quality. In our comparison, the variant data of 
individuals of the 1000 genomes project [19] serve as a 
gold standard and we refer to them as the reference set. 
If the genotype concordance between the reference var- 
iant call set and the test variant call set is high, this sug- 
gests a comparable sequencing quality. We will show in 
the following that the distance of a test sample to this 
high quality reference data is an indicator for the geno- 
typing accuracy of the exome. 

Methods 

Generation of exome data, accession of reference data 
and data processing 

Genomic DNA of European individuals was enriched 
for the target region of all human CCDS exons with 
SureSelect Human All Exon Kit (Agilent, Santa Clara, 
USA) according to the manufacturer's protocol and 
sequenced on a HiSeq 2000 (lUumina, San Diego, USA), 



yielding more than 5 gigabases raw sequence data per 
exome. The Charite University Medicine ethics board 
approved this study, which conforms to the Helsinki 
Declaration, and we obtained informed consent of all 
participants. 

Publicly available NGS raw data and variant calls of 
1,063 individuals of different populations were down- 
loaded from the ftp server of the 1000 genomes project 
[19]. Exome variants of these individuals served as refer- 
ence variant sets in our work. Exome variants of the 
5000 exomes project and of de Ligt et al. were used for 
testing the accuracy predictions [20,21]. 

Exomes of test samples were enriched with Human All 
Exon SureSelect baits from Agilent and sequenced on 
lUumina Genome Analyzer IIx and lUumina HiSeq 2000 
as 100 bp single-end reads or paired-end reads according 
to the manufacturers' protocols. Short sequence reads 
were mapped by Novoalign (Novocraft, version 2.08) or 
BWA[22] to the reference sequence GRCh37. Variants 
were detected with default settings with SAMtools [23] 
or GATK [10] on bam-formatted alignments [22]. Var- 
iant calls in variant call format (vcf) [24] were restricted 
to single nucleotide changes and to the consensus exome 
target region of the 1000 genomes project. Additionally, 
sites that were classified as technical artifacts by the 1000 
genomes project were ignored. 

Distance function 

The distance djj between any two samples and Xj 
for all positions Jlf in the target region (exome), where 
the called genotypes differ from the reference sequence 
in at least one sample, can be calculated by a weighted 

indicator function / (Xy [k] , Xj [k)) W^j [/f J, with: 

uk] = I (X, m .xAk)) = l ^i^^ ^ ^ ' 

This means that for the same genotypes the indicator 
/ is weighted by the reciprocal of the genotype fre- 
quency f [X^ {k)), which is based on the reference set 
with an appropriate background population. To give an 
example, a genotype for individual / at a given position 
x^ik = chr6 : 79595096) = C/C, x^ik = chr6 : 795950963 = C/C, 
would refer to a genotype frequency 
f [Xy[if3) = 0399, if 1 out of 1,000 individuals in 
the reference set differs from this genotype. 

For genotypes that were present only in the test sam- 
ple but not observed at all in the reference set, we set 
their frequency tol/fn + 1], where ft is the total 
number of individuals in the reference set. 
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Based on that the distance is defined as: 

where Cjj = is used as a normaUzing 

constant. 

Therefore, a disagreement at a position of low varia- 
biUty in the reference set contributes more to the total 
distance than one at a highly variable position. 

In the resulting distance matrix, 0, pairs of indivi- 
duals who are 'closely related' can be distinguished from 
those who are distinctly apart by lower distance values. 

Thus, a distance d^j — 0 means total agreement of all 

genotypes and a distance value of = 1 means total 
disagreement of all genotypes. 



Visualization of distance matrices by non-metric 
multidimensional scaling 

The output of the above-described pairwise comparison 
of variant sets is a high-dimensional distance matrix 
with given distances or dissimilarities between pairs of 
individuals that satisfy all conditions of a metric. To 
represent the dissimilarities as distances between points 
in a low-dimensional space, we used a statistical techni- 
que named non-metric multidimensional scaling (MDS), 
that is, a visualization method such as principal compo- 
nent analysis or metric MDS. However, in contrast to 
principal component analysis (PCA) and metric MDS, 
non-metric MDS does not make any assumptions about 
the distribution of the underlying high-dimensional data. 
With a pre-specified number of dimensions for the 
embedding 0 and an appropriate initial configuration, 
the isoMDS function of the MASS R-package was used 
to minimize the goodness of fit, called stress S, of Krus- 
kal and Shepard (see [25]). To promote readability and 
an easy interpretation of the data, we chose a standard 
two-dimensional embedding with: 



where 1 1 ■ 1 1 defines the Euclidean norm. 

Down sampling of raw data and simulation of 
genotyping accuracy 

For coverage-adjusted comparisons, we randomly 
removed sequence reads from the original alignments. 
Variants were recalled on these down-sampled exomes 
as described above. As genotyping accuracy we define 
the percentage of the entire exome that was correctly 



genotyped, that is the sum of true positive genotype 
calls (alternate and reference genotypes) divided by the 
entire size of the exomic target region. For our simula- 
tions, we assumed that the reference set had a genotyp- 
ing accuracy of 100% and introduced genotyping errors 
at random positions. As most of the exomic positions 
had low variability in the reference set, the contribution 
of genotyping errors to the distance function could be 
approximated by adding twice a binomial distributed 
random variable, X -■■ B {N^ p) ^- 2, to the normaliz- 
ing constant Cjj, with probability p equaling the speci- 
fied genotyping error and the number of trials 
N — Z-8 *: 10^ totsl size of the exome,. 

Computation of the standardized dissimilarity score and 
reference curve 

Distances between all individuals of the reference set were 
measured and the averaged values of the median and 
interquartile range of all columns of the distance matrix 
were computed to standardize the median of a test sample. 
The median of the distances from a test sample to all indi- 
viduals of the reference set was computed and normalized 
by subtracting the pre- calculated median of the reference 
set and dividing the interquartile range (IQR) of the refer- 
ence set. The reference curve and both 5% and 95% quar- 
tiles for the standardized dissimilarity score (SDS) were 
computed for the reference set and simulated data sets of 
decreasing error groups. 

Results 

An error sensitive genotype-weighted metric 

Like any metric, the distance measure that we used to 
compare different sequences of a set of test samples 
induces a topology. Variant calls, which describe the mea- 
surable differences between samples, represent true genetic 
variation, as well as genotyping errors. The subject of our 
work is the quality assessment of a set of exome genotypes, 
thus our metric needs to induce a topology that is sensitive 
to sequencing errors while being robust to true genetic dif- 
ferences. By using a weighting method for genotypes that 
uses their frequency of occurrence, we achieved higher pre- 
cision in accuracy prediction compared to an unweighted 
hamming distance (Figure SI in Additional file 1). If two 
samples are not the same at an exonic position, which is 
highly constrained in the population, this contributes more 
to the total distance, because such an event has a higher 
probability of being a genotyping error than divergent gen- 
otypes at a highly variable site of the exome. When we 
compare two exomes, our metric works on all genotypes 
that have been called in these samples. The genotypes are 
weighted by the degree to which the genotype is con- 
strained in the population. Though several definitions for 
measuring the degree of genotype conservation have been 
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suggested [26,27], most variable positions in the human 
genome are biallelic and for simplicity we approximate the 
conservation of a genotype by the inverse of its frequency. 
Thus the differences in two variant sets are weighted by 
their respective genotype frequencies, and the detection of 
many rare variants in a test sample therefore points to a 
higher proportion of false positive genotype calls. By con- 
trast, if many variants that are common in the population 
are not detected in a test sample, this points to a high false 
negative error rate. By this means, we compute a matrix 
that contains the distances of the test sample to all the 
samples of a reference set as well as their mutual distances. 
These distances are a result of a function that works on the 
entire exomic target region, as defined by the 1000 gen- 
omes project, and may therefore be viewed without distor- 
tion only in the multivariate, exomic space. We have 
implemented and tested our method using whole exome 
data, but it could be applied to other types of high- 
throughput sequencing data. However, the precision of 
predicting the accuracy of genotyping decreases for smaller 
target regions (Figure S2 in Additional file 1). 

Non-metric multidimensional scaling is best suited for 
distance visualization 

Because distance is based on multiple variables of 
weighted categorical data, visualization in a plane 
requires a transformation. We tested several standard 
techniques of data visualization and found that non- 
metric MDS [28] showed the best characteristics in con- 
veying the differences in genotyping accuracy in two 
dimensions (Figure S3 in Additional file 1). We there- 
fore project the exomic distance matrix into two artifi- 
cial dimensions of Ol and 02 that have the smallest 
loss of information [29-31]. 

The reference samples of the 1000 genomes project 
form clusters according to their ethnicity (Figure S4 in 
Additional file 1) and for any test sample we chose 
the closest cluster as the best matching reference set. 
Samples from the same population background form 
homogeneous clusters in non-metric MDS scaling, indi- 
cating a comparable genotyping quality (Figure lA). 

We then analyzed two test samples of European des- 
cent but of unknown genotyping accuracy and included 
them into the MDS projection of all central European 
(CEU) individuals from the 1000 genomes project, which 
is shown in Figure IB. Except for one representative 
recalled sample, NA06986, all individuals of the CEU 
reference set are displayed as black circles, whereas the 
two exome test samples are represented by the colored 
triangles. Although the mean sequence coverage of the 
exome target region is above 60x for both test samples, 
they clearly differ in their mean distance to samples from 
the reference set: the second sample is close to the clus- 
ter formed by the individuals of the reference set. 



whereas the first sample is an outlier, indicating inferior 
quality. This considerable difference in the mean distance 
is remarkable given the high sequencing depth and a 
comparable ti/tv ratio of 3:2 (Additional file 1, Table SI). 
Also the proportion of variants found in dbSNP is around 
97%, which is comparable to NA06986. Only the variance 
of the heterozygous allele frequencies, which increases 
with a growing number of artifacts from the amplification 
steps during the library preparation, suggests a lower 
quality for sample 1 with var(het AF) = 0.017 compared 
to 0.012 in test sample 2 and 0.013 in NA06986 at the 
same mean coverage [14]. 

Visualization of exomes of different genotyping accuracy 

We measured the mean distance to the reference set for 
85 exomes of European descent that were all analyzed 
in the context of NGS research projects. The two sam- 
ples displayed in Figure IB illustrate the extreme spec- 
trum of the mean distance to the reference set that we 
encountered. We hypothesized that variants often 
detected in exomes of the 1000 genomes project, but 
not in the outliers of our test samples, might point to a 
subset that requires high data quality to be properly 
detected and that might explain partly the separation 
from the reference cluster. Figure 2A displays the value 
pairs of variant allele frequencies that are based on 85 
CEU individuals from the 1000 genomes project and 85 
exome test samples. For technical replicates one would 
expect all allele frequency value pairs to lie close to the 
diagonal in this kind of visualization. For two sample 
sets of equal size that are drawn from the same popula- 
tion, one would expect a certain degree of variance in 
the measured allele frequency that is simply based on 
the finite sample size: given the allele frequencies one 
would expect for a sample size of 85 that about 95% of 
the frequency value pairs fall inside the boundaries of 
the displayed ellipse based on a Bernoulli distribution. 
However, in our case there are considerably more out- 
liers than expected by chance. 

Characteristics of sequence variants with high error rates 

We looked for similarities of these outliers and com- 
puted the GC content of 100 bases flanking the variant 
alleles that were present in more than half of the indivi- 
duals analyzed in the 1000 genomes project but in only 
one or less of our analyzed samples. The distribution of 
the GC content of these variants clearly deviates from 
the distribution that one would expect for randomly 
located variants in the exome (Figure 2B). 

To investigate the reasons for the higher false negative 
error rate for variants in a GC-rich sequence context, 
we computed the mean read coverage of the target 
region depending on the GC-content. Figure 2C shows 
that the distribution of the coverage is smaller for test 
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Figure 1 Similarities of exome samples. (A) The similarity between exome samples is measured with a genotype frequency weighted metric. 
The similarity matrix is visualized by non-metric multidimensional scaling in the two-dimensional plane. Variant sets of individuals of European 
descent that were analyzed by the 1000 genomes project form a homogeneous cluster, irrespective of the applied sequencing technology. (B) 
Two test samples of the same ethnicity that we genotyped according to a standard exome protocol are compared to samples of the reference 
set. The larger distance of sample 1 to the reference cluster indicates a lower genotyping accuracy, while sample 2 points to a high quality. 
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Figure 2 Analysis of low quality samples. (A) Comparison of variant allele frequencies in different sample sets. Value pairs of genotype 
frequencies were computed for exomes of the reference set (CEUs from the 1000 genomes project) and compared to test samples of the same 
ethnicity that are low quality. The ellipse indicates twice the standard deviation assuming a binomial model for the allele frequency p. Variants in 
the right lower quadrant were called with a lower probability in our test samples and are characterized by a GC content that deviates from the 
expected mean. (B) GC content at false negative positions. Variants that are underrepresented in exomes with a large distance to the high 
quality reference set are overrepresented in exome regions with high GC content (violet curve). The green curve indicates the distribution of the 
GC-content that is expected for an equal number of variants that are randomly drawn from the exome. (C) Coverage against GC content. The 
mean sequence coverage of the consensus exome varies with the GC content of the target region. The overall coverage for an exemplary 
sample from the 1000 genomes project (NA06986) was higher compared to test samples 1 and 2. Test sample 1 has a particularly low coverage 
in regions with extreme GC content, suggesting a higher error rate. 
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sample 1 compared to test sample 2 and NA06986 that 
was sampled down to a comparable mean coverage. 
Thus, the higher distance of test sample 1 to the refer- 
ence set is partly due to a critically low sequence read 
coverage of regions with an extreme GC-content. Benja- 
mini et aL studied the bias caused by GC content 
depending on read coverage in detail for lUumina 
sequencing data and showed that it also varies between 
technical replicates [32]. This also means that two sam- 
ples may have different genotyping accuracy for the 
whole exome although they have been processed by the 
same protocol. 

The distance to the reference set versus coverage and 
error rates 

We hypothesized that the distance to the high quality 
reference set should grow when the amount of raw 



sequence data decreases. We therefore successively 
reduced the sequence coverage in the raw exome data of 
NA06986, called variants anew, and observed an increas- 
ing distance to the reference set (green to blue circle in 
Figure 3). A decreasing sequencing coverage will not only 
yield an increasing rate of false negative genotypes but 
also increase the rate of false positive calls. It is more 
likely that a sequencing error will be called as a heterozy- 
gous variant, and a heterozygous variant as homozygous, 
if the position is only covered by a few reads. We then 
analyzed how an increasing false positive error affects the 
distance to the reference set by simulating detection arti- 
facts that were randomly distributed over the target 
region and added to a sample from the reference set. The 
triangles in Figure 3 show that the data points follow a 
trajectory that departs from the reference cluster with 
growing error rate. It has to be noted that the 
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Figure 3 Distance to the reference set for decreasing coverage and increasing error rate. A reduction of the raw sequence amount for a 
randomly chosen sample of IKG project, NA06986, reduces the similarity to the reference set (green-blue circles) indicating decreasing 
genotyping accuracy. A similar effect is observed when simulated genotyping errors are added to the variant calls (red-yellow triangles). 
A decreasing similarity of the test samples results in divergent trajectories. 
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visualization of multiple versions of the same test sample 
that differ only in coverage or error rates in a single MDS 
plot contorts the relative distances. All the depicted 
simulated data points in Figure 3 originate from the same 
data set and are therefore self-similar. The self-similarity 
is high for a low error rate and high coverage and 
decreases with increasing error rate and decreasing cov- 
erage, as the divergent trajectories of circles and triangles 
indicate. We also obtained similar results in the analysis 
of simulated data sets of other ethnic backgrounds (Fig- 
ure S5 in Additional file 1). 

A genotyping error of 0.00001 corresponds to an 
expectation of one genotyping error in approximately 
100 kb of the target region. Two randomly chosen 
samples from the reference set would differ in about 
100 positions in such a window of 100 kb [21]. The 
samples with the simulated error rate begin to separate 
from the high quality cluster for error rates above 
0.00001, which corresponds to a positive predictive 
value of 0.99 (number of true positive divided by num- 
ber of positive calls). Interestingly, the positive predic- 
tive value that was reported by Tennessen et al, for 
the variant calls of the 5000 exomes project is between 
0.97 and 0.98 [21]. The resolution of our visualization 
techniques is therefore sufficient to display these quali- 
tative differences. 



Comparison of exome data from different next- 
generation sequencing studies 

In contrast to the 1000 genomes project, the genotype 
calls from the 5000 exomes project were publicly available 
only in a collapsed form as genotype frequencies for 
European Americans and African Americans and not as 
separate variant sets for each sample. In addition to our 
in-house exome data, we also analyzed the distances to the 
reference cluster for exomes that we simulated based on 
the genotype frequencies from these European Americans 
and 100 exomes that were already studied by de Ligt et al 
[20]. Figure 4A shows the distribution of the SDS, which 
represents a normalized distance of a test sample to the 
reference cluster. The mean SDS for the simulated exomes 
of the 5000 exomes project is comparable to our exome 
data and lower than the SDS of the de Ligt et al exomes. 
The smaller variance of the SDS distribution in the 5000 
exomes samples, which points to a higher self-similarity, is 
due to a simulation process that did not properly represent 
the haplotype substructure of the data. The higher mean 
SDS of the de Ligt et al. data is mainly explained by out- 
dated genotyping algorithms with higher error rates and a 
lower mean coverage of the target region. Figure 4B 
depicts the coverage distribution over the exome and addi- 
tional quality parameters for an exemplary sample from de 
Ligt et al. and our two test samples. 
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Figure 4 Distributions of standardized dissimilarity scores for different exomes (A) 100 samples, including test samples 1 and 2, were 
sequenced on the lllumina platform with mean coverage above 60x. The mean SDS for the lllumina samples is comparable to the mean SDS of 
100 exomes that were simulated based on genotype frequencies from the 5000 exomes project. The variation of these SDS is smaller due to 
missing haplotype information. (B) The mean SDS of 100 exomes that were sequenced on a Solid platform is considerably higher due to a lower 
and less uniform sequence distribution over the target region and due to less accurate variant calls. 
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By simulating increasing sequencing noise for all exome 
data sets of the 1000 genomes, we derived distributions 
for the mean distances of the original data. Based on these 
distributions, we computed a reference curve for the SDS 
of an unknown sample that correlates with its exome-wide 
accuracy (Figure 5). The SDS that is measured for our test 
samples may be used for estimating their genotyping accu- 
racy by intersecting with the reference curve. For test sam- 
ple 1, the mean distance to the reference set was 0.29, 
which corresponds to an estimated genotyping error of 
0.0001. By contrast, the SDS of 0.11 for test sample 2 indi- 
cates an error rate that is much closer to that of the 1000 
genomes project. Interestingly, the distance to the refer- 
ence set shows characteristics of a phase transition, when 
the contribution of genotyping errors exceeds the genetic 
variability between individuals. We also checked the valid- 
ity of our approach by deriving the genotyping accuracy 
via a complementary method. In Heinrich et al, we ana- 
lyzed the effect of amplification steps during sample 



preparation and derived rates of genotyping errors from 
technical repUcation [14]. The SDSs for these replicates 
indicate genotyping accuracy for the exomes between 
99.99% and 99.999%, which is in good agreement with the 
previously computed accuracy. 

Thus, the SDS is a parameter derived from the composi- 
tion of a variant set and is even more powerful in predict- 
ing the data quality than other quality control parameters 
such as coverage distributions, which require access to the 
read alignments (Additional file 1, Table SI). 

We tested the influence of the sequencing platform on 
the error prediction by our approach by restricting the 
reference set to samples that were sequenced with the 
same technology (Illumina). Although the visualization 
in metric MDS clearly shows that the test samples are 
closest to the Illumina samples from the reference set 
(Figure S3 in Additional file 1), the effect of the sequen- 
cing platform on the accuracy prediction is marginal 
(Figure S6 in Additional file 1). The SDS is therefore 
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Figure 5 Estimation of genotyping errors from standardized dissimilarity scores. The reference curve with its 5% and 95% quantiles is 
based on the distances of samples with simulated error rates to the reference set. The SDS of a test sample indicates its error rate by its 
intersection with the reference curve. The estimated error rate of test sample 1 is considerably higher than of test sample 2 and of NA06986 
from the 1000 genomes project. SDS, standardized dissimilarity score. 
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robust and can be applied to predict the quality of geno- 
typing data from different sequencing technologies. We 
used the platform independence of our approach to ana- 
lyze the quality of a sample from the 1000 genomes pro- 
ject, NA12878, that was recently re-sequenced by a new 
technology (Proton, Ion Torrent, aligned with TMAP 
and genotyped with variantCaller). In Figure 6, the dis- 
tance of this Proton variant set of NA12878 to the refer- 
ence set is compared to a variant set of the same 
individual based on lUumina raw data that was down- 
sampled to a mean coverage of 30 fold as described 
above. This figure visualizes how the quality of two 
exomes of the same individual that were generated on 
two different sequencing platforms and processed by dif- 
ferent bioinformatics pipelines may be interpreted at a 
glance. We used the SDS to estimate the genotyping 
accuracy and predicted 99.9% for the low coverage Illu- 
mina exome and above 99.99% for the Proton exome, 
which is in good agreement with the values based on 
Sanger validation for this sample. 

Discussion 

We have described a new approach to assess the accu- 
racy of variant calls from NGS studies. The genotyping 
accuracy for variant calls, that is genotypes that differ 
from the true sample sequence, has been estimated in 
large-scale NGS-based projects such as the 1000 gen- 
omes project [19] or the exome sequencing project [21] 
and comparisons of NGS platforms [33]. In these 
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projects, samples were sequenced to a very high mean 
coverage on different sequencing platforms and the 
reported variants represent an intersection of technical 
replicates and independent analysis pipelines. Even in 
these high quality data sets, up to about 2% of the var- 
iants cannot be vaUdated if re-sequenced by a comple- 
mentary approach such as ABI Sanger. In such a high 
quality exome, one detects around 15,000 single nucleo- 
tide variants per 30 Mb coding sequence and approxi- 
mately 300 of them are likely false positive calls, which 
corresponds to an error rate of 300/30n0^ = 0.00001. 

Based on simulated accuracy groups for variant calls, 
we were able to assess the quality of an exome test sam- 
ple without detailed knowledge of the applied enrich- 
ment and sequencing technology or of the 
bioinformatics pipeline that was used to align the reads 
and call the genotypes. The SDS is therefore suitable for 
a comprehensive quality control in all exome-based 
mutation screens and might turn out especially useful as 
a criterion for data inclusion in studies that combine 
exome data of different sources, due to its platform 
independence. The estimated genotyping error in parti- 
cular might serve as quality criterion before variants 
detected in an exome are further analyzed: only if the 
estimated error is comparable to that of a high quality 
reference set such as the 1000 genomes project would 
one proceed with variant analysis. We envision that our 
approach to estimate the genotyping accuracy of exomes 
will facilitate the quality assessment of NGS data. 
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Figure 6 Comparison of exome data of different NGS platforms. A) Exome genotypes of individual NA12878 were called based on an 
lllumina read data set of low coverage and a Proton dataset. The Proton exome is closer to the CEU reference data than the low coverage 
lllumina exome. B) The accuracies of the exome genotype sets were estimated based on the mean distance to the reference set. 
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Software that computes the SDS, visualizes the dis- 
tances to data of the 1000 genomes project, and predicts 
the genotyping accuracy is available for download and as 
a web service at GeneTalk [19]. 

Web resources 

The URLs for data and methods presented herein are as 
follows: 

NHLBI Exome Sequencing Project (ESP) Exome Var- 
iant Server, http://evs.gs.washington.edu/EVS/ 
GeneTalk, https://gene-talk.de/qc 

ftp server of the 1000 genomes project, ftp:// 
ftp.lOOOgenomes.ebi.ac.uk./voU/ftp/ 

Additional material 



indicating an increasing false positive error rate. The ti/tv ratio and the 
ratio of variants that are present in dbSNP vary only very little with 
changing coverage. Different priors in the genotyping models of 
Samtools and GATK result in different mean genotype quality scores for 
the same alignments. Quality score recalibration with GATK 
VariantRecalibrator performed on Samtools- and GATK-called variants ads 
an adjusted quality score, VQSLOD (log odds ratio of being a true variant 
versus being false under a trained gaussian mixture model). This score is 
used in GATK ApplyRecalibration to generate a tranche file of highly 
confidential calls. The percentage of these calls which passed as high- 
confidential shows an irregular behavior with respect to the mean 
coverage. The SDS, which is computed for the entire variant call set of a 
sample, correlates with its accuracy and allows a sample-to-sample 
comparison. 
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